Cost Functions

Cost functions

A cost function determines the "cost" (or penalty) of estimating $\hat{x}$ when the true or correct quantity is really $x$ .
This is essentially the cost of the error between the true stimulus value $y_{i}$ and our estimate $f (x_{i})$ .
cost function formula

J(\theta) = \frac{1}{m} \sum {#m} L (f(x_i), y_i)

where $L$ is the loss function.

Important

Cost vs. Loss:
loss applies to a single training sample; cost is the mean of summed loss.

Forms of cost functions

Note that the error can be defined in different ways:

\begin{array}{rcl} Mean Squared Error & = & (x - \hat{x})^{2} \\ Absolute Error & = & | x - \hat{x} | \\ Zero-One Loss & = & {\begin{cases} 0, & if x = \hat{x} \\ 1, & otherwise \end{cases} \end{array}

Important

Find more types of error in Error Metrics.
In ML, Mean Squared Error is commonly used as the cost function, but with an extra division by 2, which "is just meant to make later partial derivation in gradient descent neater" :

J(\theta) = \frac{1}{2m} \sum {#m} (\hat{x_i} - x_i)^2

Cost function with regularization

When you choose Regularization, a regularization term will be added to the cost function, in order to add penalty and avoid overfitting.

Example - linear regression:

J(\theta) = \frac{1}{2m} \sum_i {#m} (\hat{x_i} - x_i)^2 + \frac{\lambda}{2m} \sum_j^n \theta_j ^ 2

Example - logistic regression:

J(\theta) = \frac{1}{m} \sum {#m} ( -y_i log(f(x_i)) - (1-y_i)log(1 - f(x_i))) + \frac{\lambda}{2m} \sum_j^n \theta_j ^ 2

where $j$ represents the $j$ th feature.

Different types of regularization terms can be added:
- L1 regularization = train to minimize normal loss + c * L1(weights)
  - L1: sum of the absolute values of the weights; like lasso regression
  - Drives some weights to 0
  - good for models with fewer features, each of them has a large or median effect
- L2 regularization = train to minimize normal loss + c* L2(weights)
  - L2: sum of the squares of the weights; like ridge regression
  - Makes the biggest weights smaller
  - heavily punishing “outliers”, which are the very large parameters
  - good for models with many features, each of them has a small effect
- Train to minimize normal loss - but don’t let the weights get too big
  - Like an L-infinity penalty

The Hundred-Page Machine Learning Book

In practice, L1 regularization produces a sparse model, a model that has most of its parameters equal to zero, provided the hyperparameter C is large enough. So L1 performs feature selection by deciding which features are essential for prediction and which are not. That can be useful in case you want to increase model explainability.
However, if your only goal is to maximize the performance of the model on the holdout data, then L2 usually gives better results. L2 also has the advantage of being differentiable, so gradient descent can be used for optimizing the objective function.

Loss and cost for different functions

Loss and cost for linear regression -> Analytic solution

in matrix form, with $Y = X w + b$ , the loss function is

| | y - X w | |^{2} $ $ W h e n t h e d e r i v a t i v e o f t h e l o s s i s 0 t o a c h i e v e t h e m i n i m u m o f t h e l o s s : $ $ \partial_{w} | | y - X w | |^{2} = 2 X^{T} (X w - y) = 0 $ $ t h e s o l u t i o n i s $ $ w^{*} = (X^{T} X)^{- 1} X^{T} y

The solution will only be unique when the matrix $X^{T} X$ is invertible, i.e., when the columns of $X$ are linearly independent.

Loss and cost for logistic regression

MSE is not proper because the cost function would not be convex.

loss function $L (f (x_{i}), y_{i})$ :

y_{i} = 1 : L = - l o g (f (x_{i}))

y_{i} = 0 : L = - l o g (1 - f (x_{i}))

Combined the above formula together, we get the simplified cost function for logistic regression:

L = - y_{i} l o g (f (x_{i})) - (1 - y_{i}) l o g (1 - f (x_{i}))

then the cost function with full form (also used in Maximum likelihood estimation for logistic regression):

J(\theta) = \frac{1}{m} \sum {#m} ( -y_i log(f(x_i)) - (1-y_i)log(1 - f(x_i)))

Loss and cost for Softmax

What is Softmax: Artificial Neural Networks#^6ef895

The loss function associated with Softmax, the cross-entropy loss, is:

\begin{matrix} (3) & L (a, y) = {\begin{cases} - l o g (a_{1}), & if y = 1 . \\ ⋮ \\ - l o g (a_{N}), & if y = N \end{cases} \end{matrix}

Only the line that corresponds to the target contributes to the loss, other lines are zero:
$$\mathbf{1}{y == n} = =\begin{cases}
1, & \text{if $y == n$ }.\
0, & \text{otherwise}.
\end{cases}$$
Cost function:

\begin{array}{r} (4) & J (w, b) = - \frac{1}{m} [\sum_{i = 1}^{m} \sum_{j = 1}^{N} 1 {y^{(i)} == j} \log \frac{e^{z_{j}^{(i)}}}{\sum_{k = 1}^{N} e^{z_{k}^{(i)}}}] \end{array}

where $m$ is the number of examples, $N$ is the number of outputs. This is the average of all the losses.

Important

Cross-entropy takes the full distribution into account.

Expected loss function

A posterior distribution tells us about the confidence or credibility we assign to different choices. A cost function describes the penalty we incur when choosing an incorrect option. These concepts can be combined into an expected loss function.

Expected loss is defined as:

E [Loss | \hat{x}] = \int L [\hat{x}, x] ⊙ p (x | \tilde{x}) d x

where $L [\hat{x}, x]$ is the loss function, $p (x | \tilde{x})$ is the posterior, and $⊙$ represents the Hadamard Product (i.e., elementwise multiplication), and $E [Loss | \hat{x}]$ is the expected loss.

- The posterior's mean minimizes the mean-squared error.
- The posterior's median minimizes the absolute error.
- The posterior's mode minimizes the zero-one loss.

Good Practice in minimizing loss function

be aware of Clever Hans Effect: model learns to minimize loss functions, but it does not learn the thing it should learn...